Search CORE

23 research outputs found

Sparse Stochastic Bandits

Author: Kwon Joon
Perchet Vianney
Vernade Claire
Publication venue
Publication date: 05/06/2017
Field of study

In the classical multi-armed bandit problem, d arms are available to the decision maker who pulls them sequentially in order to maximize his cumulative reward. Guarantees can be obtained on a relative quantity called regret, which scales linearly with d (or with sqrt(d) in the minimax sense). We here consider the sparse case of this classical problem in the sense that only a small number of arms, namely s < d, have a positive expected reward. We are able to leverage this additional assumption to provide an algorithm whose regret scales with s instead of d. Moreover, we prove that this algorithm is optimal by providing a matching lower bound - at least for a wide and pertinent range of parameters that we determine - and by evaluating its performance on simulated data

arXiv.org e-Print Archive

HAL-Polytechnique

Stochastic Bandit Models for Delayed Conversions

Author: Cappé Olivier
Perchet Vianney
Vernade Claire
Publication venue
Publication date: 12/07/2017
Field of study

Online advertising and product recommendation are important domains of applications for multi-armed bandit methods. In these fields, the reward that is immediately available is most often only a proxy for the actual outcome of interest, which we refer to as a conversion. For instance, in web advertising, clicks can be observed within a few seconds after an ad display but the corresponding sale --if any-- will take hours, if not days to happen. This paper proposes and investigates a new stochas-tic multi-armed bandit model in the framework proposed by Chapelle (2014) --based on empirical studies in the field of web advertising-- in which each action may trigger a future reward that will then happen with a stochas-tic delay. We assume that the probability of conversion associated with each action is unknown while the distribution of the conversion delay is known, distinguishing between the (idealized) case where the conversion events may be observed whatever their delay and the more realistic setting in which late conversions are censored. We provide performance lower bounds as well as two simple but efficient algorithms based on the UCB and KLUCB frameworks. The latter algorithm, which is preferable when conversion rates are low, is based on a Poissonization argument, of independent interest in other settings where aggregation of Bernoulli observations with different success probabilities is required.Comment: Conference on Uncertainty in Artificial Intelligence, Aug 2017, Sydney, Australi

arXiv.org e-Print Archive

Multiple-Play Bandits in the Position-Based Model

Author: Cappé Olivier
Lagrée Paul
Vernade Claire
Publication venue
Publication date: 05/01/2016
Field of study

Sequentially learning to place items in multi-position displays or lists is a task that can be cast into the multiple-play semi-bandit setting. However, a major concern in this context is when the system cannot decide whether the user feedback for each item is actually exploitable. Indeed, much of the content may have been simply ignored by the user. The present work proposes to exploit available information regarding the display position bias under the so-called Position-based click model (PBM). We first discuss how this model differs from the Cascade model and its variants considered in several recent works on multiple-play bandits. We then provide a novel regret lower bound for this model as well as computationally efficient algorithms that display good empirical and theoretical performance

arXiv.org e-Print Archive

HAL-CentraleSupelec

HAL-Rennes 1

Beyond Average Return in Markov Decision Processes

Author: Garivier Aurélien
Marthe Alexandre
Vernade Claire
Publication venue
Publication date: 31/10/2023
Field of study

What are the functionals of the reward that can be computed and optimized exactly in Markov Decision Processes? In the finite-horizon, undiscounted setting, Dynamic Programming (DP) can only handle these operations efficiently for certain classes of statistics. We summarize the characterization of these classes for policy evaluation, and give a new answer for the planning problem. Interestingly, we prove that only generalized means can be optimized exactly, even in the more general framework of Distributional Reinforcement Learning (DistRL).DistRL permits, however, to evaluate other functionals approximately. We provide error bounds on the resulting estimators, and discuss the potential of this approach as well as its limitations.These results contribute to advancing the theory of Markov Decision Processes by examining overall characteristics of the return, and particularly risk-conscious strategies.Comment: Neurips 2023, Dec 2023, New Orleans, United State

arXiv.org e-Print Archive

Modèle de distraction pour la sélection séquentielle de contenu

Author: Cappé Olivier
Vernade Claire
Publication venue: HAL CCSD
Publication date
Field of study

National audience<p>Dans le contexte du marketing sur Internet, il est fréquent que les publicitésprésentées aux utilisateurs soient hiérarchisées : les mieux placées retiennent l’attentionde l’utilisateur et obtiennent plus de clics, indépendamment de leur contenu propre. Pourconstruire séquentiellement une campagne qui recueille de nombreux clics sans informationa priori sur la qualité des articles, il faut donc être capable d’apprendre quelle est lameilleure liste ordonnée de L parmi K produits disponibles dans le catalogue. À chaquefois qu’une liste est proposée à l’internaute, celui-ci clique sur certains produits et c’estl’unique information qu’il envoie au système. Dans le cadre de l’apprentissage séquentiel,ce dernier doit alors mettre à jour ses estimateurs afin de proposer une liste potentiellementmeilleure au futur visiteur. L’inconvénient des méthodes existantes pour résoudre ceproblème réside dans les modèles : ceux-ci négligent la relative inattention de l’utilisateur,ce qui induit une sous-estimation des probabilités de clics pour les produits présentés etde possibles failles dans l’exploration. Nous proposons donc une manière d’inclure cetaspect dans un modèle de Bandits Manchots original. Après avoir précautionneusementétudié l’impact de la distraction de l’utilisateur sur les performances asymptotiques desalgorithmes, nous exploitons le principe d’optimisme face à l’incertitude pour proposerune série d’algorithmes efficaces que nous évaluons expérimentalement.</p